This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.
HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).
The cleaning steps demonstrated include:
case_when())replace missing with dealing with cases (all lower, etc) case_when() factors
Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)
You can view the original raw dataset below:
Text here about a cleaning pipeline of dplyr verbs… order is important
Variable names are used so often, it is best that they have “clean” syntax. We suggest the following:
The names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).
names(linelist_raw)
## [1] "case_id" "generation" "infection date" "date onset"
## [5] "hosp date" "date_of_outcome" "outcome" "gender"
## [9] "hospital" "lon" "lat" "infector"
## [13] "source" "age" "age_unit" "fever"
## [17] "chills" "cough" "aches" "vomit"Note: To use a variable names that include spaces, surround the name with back-ticks, for example: linelist$`infection date`
On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).
The function clean_names() from the package janitor is very useful. Here is an online vignette
# send the dataset through the function clean_names()
linelist <- linelist_raw %>%
janitor::clean_names()
# see the new names
names(linelist)
## [1] "case_id" "generation" "infection_date" "date_onset"
## [5] "hosp_date" "date_of_outcome" "outcome" "gender"
## [9] "hospital" "lon" "lat" "infector"
## [13] "source" "age" "age_unit" "fever"
## [17] "chills" "cough" "aches" "vomit"Re-naming variables manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new variable name is given before the old variable name.
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome)Now you can see that the variables names have been changed:
Often the first step of cleaning data is selecting the variables you want to work with, and their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.
Within select() you can do the following:
# linelist dataset is piped through select() command, and prints just the variable names
linelist %>%
select(case_id, date_onset, date_hospitalisation, fever) %>%
names()
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "fever"select() which variables to remove by placing a minus symbol “-” in front of the variable name (or a vector of variable names). It will keep all others. Inside select() you can use normal operators such as : for consecutive values, c() to list values/variables, ! for opposite, & for AND, and | for OR.linelist %>%
select(-c(fever:vomit)) %>%
names()
## [1] "case_id" "generation" "date_infection"
## [4] "date_onset" "date_hospitalisation" "date_outcome"
## [7] "outcome" "gender" "hospital"
## [10] "lon" "lat" "infector"
## [13] "source" "age" "age_unit"everything() to signify all other variables not specified:linelist %>%
select(case_id, date_onset, date_hospitalisation, fever, everything()) %>%
names()
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "fever" "generation" "date_infection"
## [7] "date_outcome" "outcome" "gender"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "chills" "cough"
## [19] "aches" "vomit"As well as everything() there are several special functions that work within select(), namely:
everything() - all other variables not mentioned
last_col() - the last column
starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
contains() - variables containing a character string. Example: select(contains("time"))
matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
num_range() -
any_of() - matches if variable is named. Useful if name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))
where() - applies a function to all variables and selects those which are TRUE
select as a standalone command TO DO
See section on object classes
Here we want to ensure that the class of each variable is appropriate, so we’ll add it to our cleaning pipe chain.
The class of the “age” variable is character. To perform analysis, we need those numbers to be recognized as numeric!
The class of the “date_onset” variable is also character! To perform analysis, these dates must be recognized as dates!
Use table() or another method to see all the values, can see that we see that one date was entered in a different format (15 April 2014) than all the others!
##
## 15 April 2014 2014-04-07 2014-04-21 2014-04-25 2014-04-26
## 1 1 2 1 1
## 2014-04-27
## 1
This means before we can classify “date_onset” as a date, we must fix this value. We can do this using mutate() and recode() in our cleaning pipe chain, before the commands to convert to class Date. LINK TO CLASSIFYING VARIABLE AS DATE.
The new mutate line can be read as: mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE. Note that this pattern (OLD = NEW) is the opposite of most R patterns. The R development community is working on revising this.
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
# fix incorrect values
# old value # new value
mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>%
# correct the class of the variables
mutate(age = as.numeric(age),
date_onset = as.Date(date_onset, format = "%Y-%m-%d"))Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! The date format = entered is often a source of problems.
class(linelist$date_infection)
## [1] "POSIXct" "POSIXt"
head(linelist$date_infection)
## [1] "2014-04-09 UTC" NA NA "2014-05-07 UTC"
## [5] NA "2014-05-06 UTC"You can use The dplyr function across() with mutate() to convert several variables at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the variables where is.POSIXct() is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.
across() we also use the function where().is.character(), is.numeric(), and is.logical() are from base R)across() are written without empty parenthesesAfter selecting columns, a typical cleaning step is to filter the dataframe for specific rows using the dplyr verb filter()
filter(dataset, criteria) OR subset like: dataset_new <- dataset[criteria,criteria]
We advise creating new variables with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use the base R style to create a new variable.
As explained in the section on dplyr and tidyverse coding style (LINK HERE), a chain of ‘verb’ functions operate on a dataset through ‘pipes’ (%>%), passing the output from one verb to the next. The verb mutate() used to add a new variable or modify an existing one.
Below are some example of creating new variables with mutate(). The syntax is: new_variable_name = value or function. It is best practice to separate each new variable with a comma and new line.
linelist <- linelist %>% # creating new, or modifying old dataset
mutate(new_var_dup = case_id, # new variable = duplicate/copy another variable
new_var_static = 7, # new variable = all values the same
new_var_static = new_var_static + 5, # you can overwrite a variable, and can modify a variable multiple times
new_var_calc = (age / 12), # new variable = a calculation
new_var_paste = paste0(hospital, " (", date_hospitalisation, ")") # new variable = pasting together values from other variables
) Scroll to the right to see the new variables:
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )TIP: The verb transmute() adds new variables just like mutate() but also drops/removes all other variables that you do not mention.
To recode the values in a variable, mutate() is also used.
In this case, we notice that we need to clean the variable “hospital”. There are several incorrect spelling, and many missing values.
table(linelist$hospital, useNA = "always")
##
## Connaught Hopital
## 47
## Connaught Hospital
## 1715
## Military Hopital
## 30
## Military Hospital
## 786
## Mitylira Hopital
## 1
## Mitylira Hospital
## 79
## other
## 885
## Princess Christian Maternity Hopital (PCMH)
## 11
## Princess Christian Maternity Hospital (PCMH)
## 411
## Rokupa Hopital
## 11
## Rokupa Hospital
## 443
## <NA>
## 1469For to change spellings one-by-one, you can use the recode() function within the mutate function. The code is saying that the variable “hospital” should be defined as the current variable “hospital” but with certain changes (syntax is OLD = NEW). Don’t forget commas!
linelist <- linelist %>%
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Connaught Hopital" = "Connaught Hospital",
"Rokupa Hopital" = "Rokupa Hospital",
"other" = "Other",
"Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
))
table(linelist$hospital, useNA = "always")
##
## Connaught Hospital
## 1762
## Military Hospital
## 896
## Other
## 885
## Princess Christian Maternity Hospital (PCMH)
## 422
## Rokupa Hospital
## 454
## <NA>
## 1469TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.
TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").
If you need to use logic statements with multiple variables, or want to use operators like %in%, use case_when() instead, as below. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic!
linelist <- linelist %>%
mutate(hospital = case_when(hospital == "Connaught Hopital" ~ "Connaught Hospital",
hospital == "Rokupa Hopital" ~ "Rokupa Hospital",
hospital %in% c("Mitylira Hopital",
"Mitylira Hospital",
"Mitylira Hospital",
"Military Hopital") ~ "Military Hospital",
is.na(hospital) ~ "Missing",
hospital == "Princess Christian Maternity Hopital (PCMH)" ~ "Princess Christian Maternity Hospital (PCMH)",
TRUE ~ hospital)
)
table(linelist$hospital, useNA = "always")
##
## Connaught Hospital
## 1762
## Military Hospital
## 896
## Missing
## 1469
## Other
## 885
## Princess Christian Maternity Hospital (PCMH)
## 422
## Rokupa Hospital
## 454
## <NA>
## 0To change missing values to a character value, such as “Missing”, use the function replace_na() in the same manner as recode above:
Likewise you can quickly convert character values to NA using na_if(), as below:
Intro For simple cases you can use ifelse() or if_else(). In most cases it is better to use case_when().
ifelse() and if_else():
These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if TRUE, value if FALSE). if_else() is a special version from dplyr that handles dates.
Stringing together ifelse statements - NOT ADVISED!! Difficult to read and keep track of.
IMAGE of ifelse string with X across is.
Use case-when() instead.
You can reference other variables with the ifelse() function within mutate():
Missing if… na_if() lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),
coalesce()
if_else(), ifelse()
recode CLEANING MISSPELLINGS HOSPITAL NAME
Replace
## load cleaning rules and only keep variables in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
filter(variable %in% c(names(mll_raw), ".global"))
## define variables that are not cleand
unchanged <- c(
"epilink_relationship",
"narratives",
"epilink_relationship_detail"
)
mll_clean <- mll_raw %>%
## convert to tibble
as_tibble() %>%
## clean variables using cleaning rules
clean_data(
wordlists = mll_cleaning_rules,
protect = names(.) %in% unchanged
)Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html
Taken from website above:
Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:
starwars %>%
select(name, mass, species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars %>%
select(name, mass, species) %>%
group_by(species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.
If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new variable by assigning it a value. In the command below, the variable new_var does not exist until after the command is executed. In this simple example the variable is assigned the static value “new value”, so for all rows the value will be “new value”.
You can also give the new variable a dyanmic value as shown below, or using the case_when() command explained in the next tab.
case_when())TODO tutorial on using case_when()
For example, creating age groups cut()
case_when()
age_categories() (R4Epis package)
by percentile
WHAT TO DO IF AGE IS SPREAD ACROSS TWO VARAIBLES (e.g. numeric age + unit)
Within a group, indicate/convert to the highest value in the group
Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)